Hi there, my name is Sujith Pai. I hold two degrees in engineering, one in mechanical engineering and one in aerospace engineering from the University of California,Irvine. I am currently pursuing a masters in mechanical enginnering at the University of Dayton, Ohio. This is my final project for Math 543
RESEARCH INTERESTS Combustion Internal combustion engines Designing of internal combustion engines PROFESSIONAL EXPERIENCE Canara springs, Karnataka, India Techno marine, UAE CONTACT paisujith@yahoo.com 9493446755
Sujith Pai
In today’s fast moving world we are more focused on making life easier for human beings and are less focused on the impact of our actions. The Dataset obtained from the UCI learning repositiry is used to build a linear model to predict the toxicity of water towards the Flathead Minnowo. First the data is diagnosed and made sure that all of our assumptions are met. Collinearty is found within two of the continous regressors in the dataset. The continous dataset is reorganized using the ifelse function to provide better performance. One of the variables is droped after analysing all possible models using the regsubset model. The final model is obtained with around ~60% prediction accuracy.
# A tibble: 6 x 7
X1 X2 X3 X4 X5 X6 Y
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3.26 0.829 1.68 0 1 1.45 3.77
2 2.19 0.580 0.863 0 0 1.35 3.12
3 2.12 0.638 0.831 0 0 1.35 3.53
4 3.03 0.331 1.47 1 0 1.81 3.51
5 2.09 0.827 0.86 0 0 1.89 5.39
6 3.22 0.331 2.18 0 0 0.706 1.82
In today’s fast moving world we are more focused on making life easier for human beings and are less focused on the impact of our actions. Several of our rivers and water sources are highly polluted with plastics and chemicals making these water bodies uninhabitable for fish and other aquatic life. The dataset being used was found in the database of the UCI machine learning repositiry to predict acute aquatic toxicity towards the fish Pimephales promelas (fathead minnow) on a set of 908 chemicals.They are very tolerant of a wide range of conditions in both water clarity and pH. The largest populations are found in streams or bog ponds where the conditions are rather poor for most other species of fish that is why this particular species has been used for this experiment(2).Coincidentally the highest population of the flathead minnow is in Ohio,USA. The Flathead Minnow also produces a lot more offsprings which is beneficial to our study. LC50, which is the concentration that causes death in 50% of test fish over a test duration of 96 hours, was used as model response. The model comprised 6 molecular descriptors: MLOGP (molecular properties), CIC0 (information indices), GATS1i (2D autocorrelations), NdssC (atom-type counts), NdsCH ((atom-type counts), SM1_Dz(Z) (2D matrix-based descriptors).The goal ofthe study is to be able to accurately predict the concentration of LC50 in the water. This concentration tells us whether the the water is poisonous for the fish. The first 6 values of the dataset is shown above to be analyzed.
Pimephales promelas
# A tibble: 6 x 7
CIC0 SM1_Dz GATS1i NdsCH NdssC MLOGP LC50
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3.26 0.829 1.68 0 1 1.45 3.77
2 2.19 0.580 0.863 0 0 1.35 3.12
3 2.12 0.638 0.831 0 0 1.35 3.53
4 3.03 0.331 1.47 1 0 1.81 3.51
5 2.09 0.827 0.86 0 0 1.89 5.39
6 3.22 0.331 2.18 0 0 0.706 1.82
I started off my data exploration by finding the unique values in the data set using the unique function. Only one value was found to be not unique. The size of my data set had 907 values instead of 908.I also changed the column names using the colname function to make it easier to see what each regressor stands for.
Below is a table with the instances of 0,1,2,3,4,5 and 6 that occur in the NdsCH and NdssC columns respectively.
0 1 2 3 4
759 107 29 5 7
0 1 2 3 4 5 6
621 176 81 18 8 1 2
so first off we would like to look at the values in the table to be able to judge if it is an appropriate model. So I diagnosed the dataset by utilizing various diagnosis plots. I took all of the regressors and made a linear model with LC50 as the response and looked at the pattern produced. The plot looks good and we can see the line through the middle indicating there is a linear realtionship for this dataset.
The qq plot looks good as most of the points lie on the 45 degree line.There is some deviation at the tail end but since this is a real world model with a large dataset of 907 variables, some deviation is expected. Indicating that the normality assumption is satisfied.
The equal variance does not look that great as the values tend to clump near the center and are not as spread out as we would like them to be. But this can still be considered a variable model. This will need to further investigated. There might be some collinearity between the variables. ***
The plots helps us measure the cooks distance. Cook’s distance helps us identify outliers in our dataset. In this case we only have 3 outliers (260,373 and 448) which is a negligible amount of outliers. We find these minimal number of outlier to not have any major effect on the performance of the model.
The initial summary table showed an R^2 value of 49%. To improve the performance of the model, NdsCH and NdssC values greater than 1 were assigned the value of 1. Since toxicity is determined by the presence of chemicals, the output does not depend on the number of atoms. The collinearity plot on the left uses CIC0 and LC50 on the x and Y axis respectively with NdsCH.Normally we see seperate patterns with little to no overlaping. It shows here there is severe overlaping and this will need to be investigated further. There is a possibillity that NdsCH has multicollinearity with CIC0 and LC50. ***
The plot on the left relates NdssC to CIC0 and LC50 on the x and Y axes respectively. Here the results are troublesome as well. There is no seperate patterns and there appears to be collinearity that we cannot ignore. By double clicking on the 0 legend towards the top right of the interactive plot we are able to isolate the instances of 0. THis can be used to get a better sense of the spread. ***
Reordering variables and trying again:
[1] 5
In order to find the highest performing model we have to analyse all possible combinations of 6 variables in the model. To do this we use the reg subset function and the result is displayed on the graph to the left. Using the which max function we find the highest number of variables that can be used to form the best model. Looking at the graph it can be assumed that the NdssC variable hurts the final R^2 number rather than add to it. The model in this scenario would only include CIC0,SM1_Dz,GATS1i,NdsCH and MLOGP regressors.
Call:
lm(formula = LC50 ~ CIC0 + SM1_Dz + GATS1i + NdsCH + MLOGP, data = fish)
Residuals:
Min 1Q Median 3Q Max
-3.1080 -0.3857 -0.0643 0.3524 3.8663
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02043 0.02756 -0.741 0.459
CIC0 0.22203 0.03193 6.953 6.84e-12 ***
SM1_Dz 0.35222 0.02558 13.768 < 2e-16 ***
GATS1i -0.20989 0.02843 -7.382 3.55e-13 ***
NdsCH1 0.06480 0.05073 1.277 0.202
MLOGP 0.38262 0.03407 11.229 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.676 on 901 degrees of freedom
Multiple R-squared: 0.5456, Adjusted R-squared: 0.5431
F-statistic: 216.3 on 5 and 901 DF, p-value: < 2.2e-16
Disregarding NdssC raised the R^2 value from 49% initally to 58.1% which can be gleamed from the summary table on the left. ***
The goal of this study was to determine whether a linear model can could be built from the provided dataset.; We were succesful in building a model with ~60% accurracy which is adequate for real world use. In other words reaseachers would be able to use this linear model and predict whether certain water bodies are toxic for the flathead minnow, given that other variables are known.Vastly increasing the probablility survivability for these .In conclusion, all models are wrong, some models are useful. This model was limited by the small number of observations in the dataset. As more data is collected we will be able to increase the accuracy of this model and will be able to expand this model to other fish as well.
1) UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/index.php
ONR division of wildlife http://wildlife.ohiodnr.gov/species-and-habitats/species-guide-index/fish/fathead-minnow
Dr Tessa Chen, University of Dayton,Ohio.
---
title: "Build a linear model to determine the levels of toxicity of water towards the Flathead Minnow "
output:
flexdashboard::flex_dashboard:
storyboard: true
theme: cerulean
social: menu
source: embed
---
```{r setup, include=FALSE}
# load necessary packages
library(ggplot2)
library(plotly)
library(plyr)
library(flexdashboard)
library(readr)
fish <- read_csv("C:/Users/sujit/Downloads/qsar_fish_toxicity.csv")
```
### Personal history
Hi there, my name is Sujith Pai. I hold two degrees in engineering, one in mechanical engineering and one in aerospace engineering from the University of California,Irvine. I am currently pursuing a masters in mechanical enginnering at the University of Dayton, Ohio. This is my final project for Math 543
RESEARCH INTERESTS
Combustion
Internal combustion engines
Designing of internal combustion engines
PROFESSIONAL EXPERIENCE
Canara springs, Karnataka, India
Techno marine, UAE
CONTACT
paisujith@yahoo.com
9493446755
***
```{r , echo=FALSE, fig.cap="Sujith Pai", out.width = '100%'}
knitr::include_graphics("capture.png")
```
### Abstract
In today's fast moving world we are more focused on making life easier for human beings and are less focused on the impact of our actions. The Dataset obtained from the UCI learning repositiry is used to build a linear model to predict the toxicity of water towards the Flathead Minnowo. First the data is diagnosed and made sure that all of our assumptions are met. Collinearty is found within two of the continous regressors in the dataset. The continous dataset is reorganized using the ifelse function to provide better performance. One of the variables is droped after analysing all possible models using the regsubset model. The final model is obtained with around ~60% prediction accuracy.
### Introduction
```{r}
head(fish)
```
In today's fast moving world we are more focused on making life easier for human beings and are less focused on the impact of our actions. Several of our rivers and water sources are highly polluted with plastics and chemicals making these water bodies uninhabitable for fish and other aquatic life. The dataset being used was found in the database of the UCI machine learning repositiry to predict acute aquatic toxicity towards the fish Pimephales promelas (fathead minnow) on a set of 908 chemicals.They are very tolerant of a wide range of conditions in both water clarity and pH. The largest populations are found in streams or bog ponds where the conditions are rather poor for most other species of fish that is why this particular species has been used for this experiment(2).Coincidentally the highest population of the flathead minnow is in Ohio,USA. The Flathead Minnow also produces a lot more offsprings which is beneficial to our study. LC50, which is the concentration that causes death in 50% of test fish over a test duration of 96 hours, was used as model response. The model comprised 6 molecular descriptors: MLOGP (molecular properties), CIC0 (information indices), GATS1i (2D autocorrelations), NdssC (atom-type counts), NdsCH ((atom-type counts), SM1_Dz(Z) (2D matrix-based descriptors).The goal ofthe study is to be able to accurately predict the concentration of LC50 in the water. This concentration tells us whether the the water is poisonous for the fish. The first 6 values of the dataset is shown above to be analyzed.
```{r pressure, echo=FALSE, fig.cap="Pimephales promelas", out.width = '20%'}
knitr::include_graphics("download.png")
```
(4)
### Data Exploration
```{r}
colnames(fish) <- c("CIC0", "SM1_Dz", "GATS1i", "NdsCH", "NdssC", "MLOGP", "LC50")
fish <- unique(fish)
head(fish)
```
I started off my data exploration by finding the unique values in the data set using the unique function. Only one value was found to be not unique. The size of my data set had 907 values instead of 908.I also changed the column names using the colname function to make it easier to see what each regressor stands for.
Below is a table with the instances of 0,1,2,3,4,5 and 6 that occur in the NdsCH and NdssC columns respectively.
```{r}
table(fish$NdsCH)
table(fish$NdssC)
```
### Linearity plot using all variables
```{r}
ak <- lm(LC50~CIC0+SM1_Dz+GATS1i+NdsCH+NdssC+MLOGP,fish)
plot(ak, 1, main = "RESIDUAL VS FITTED")
```
***
so first off we would like to look at the values in the table to be able to judge if it is an appropriate model. So I diagnosed the dataset by utilizing various diagnosis plots. I took all of the regressors and made a linear model with LC50 as the response and looked at the pattern produced. The plot looks good and we can see the line through the middle indicating there is a linear realtionship for this dataset.
***
### Normality plot using all variables
```{r}
plot(ak,2, main = "QQ PLOT")
```
***
The qq plot looks good as most of the points lie on the 45 degree line.There is some deviation at the tail end but since this is a real world model with a large dataset of 907 variables, some deviation is expected. Indicating that the normality assumption is satisfied.
***
### Equality Variance plot using all variables
```{r}
plot(ak,3,main = "EQUAL VARIANCE", lty = "blank")
```
***
The equal variance does not look that great as the values tend to clump near the center and are not as spread out as we would like them to be. But this can still be considered a variable model. This will need to further investigated. There might be some collinearity between the variables.
***
### Residuals vs Leverage plot using all variables
```{r}
plot(ak,4, main = "RESIDUAL VS LEVERAGE" )
```
***
The plots helps us measure the cooks distance. Cook's distance helps us identify outliers in our dataset. In this case we only have 3 outliers (260,373 and 448) which is a negligible amount of outliers. We find these minimal number of outlier to not have any major effect on the performance of the model.
### Collinearity plot for NdsCH
```{r}
library(ggplot2)
library(plotly)
fish$NdsCH <- as.factor(ifelse(fish$NdsCH>0, 1, 0))
fish$NdssC <- as.factor(ifelse(fish$NdssC>0, 1, 0))
fish$NdsCH <- as.factor(fish$NdsCH)
pai <- plot_ly(fish, x = ~CIC0, y=~LC50, color = ~NdsCH, type = "scatter",col='pink',marker = list(size = 15),colors = c("#FF0000", "#1403FF"))
ggplotly(pai)
```
***
The initial summary table showed an R^2 value of 49%. To improve the performance of the model, NdsCH and NdssC values greater than 1 were assigned the value of 1. Since toxicity is determined by the presence of chemicals, the output does not depend on the number of atoms. The collinearity plot on the left uses CIC0 and LC50 on the x and Y axis respectively with NdsCH.Normally we see seperate patterns with little to no overlaping. It shows here there is severe overlaping and this will need to be investigated further. There is a possibillity that NdsCH has multicollinearity with CIC0 and LC50.
***
### Collinearity plot NdssC
```{r}
library(ggplot2)
library(plotly)
fish$NdsCH <- as.factor(fish$NdssC)
pai1 <- plot_ly(fish, x = ~SM1_Dz, y=~LC50, color = ~NdssC, type = "scatter",col='pink',marker = list(size = 13),colors = c("#FF0000", "#20B2AA"))
ggplotly(pai1)
```
***
The plot on the left relates NdssC to CIC0 and LC50 on the x and Y axes respectively. Here the results are troublesome as well. There is no seperate patterns and there appears to be collinearity that we cannot ignore. By double clicking on the 0 legend towards the top right of the interactive plot we are able to isolate the instances of 0. THis can be used to get a better sense of the spread.
***
### Variable Selection
```{r}
library(leaps)
best <- regsubsets(LC50~., fish, nbest=1, nvmax=NULL, force.in=NULL, force.out = NULL, method="exhaustive" )
result <- summary(best)
plot(best, scale = "adjr2", main = "Adjusted R^2", col='pink' )
which.max(result$adjr2)
fish[,c(1,2,3,6,7)] <- apply(fish[,c(1,2,3,6,7)], 2, scale)
best.model <- lm(LC50~CIC0+SM1_Dz+GATS1i+NdsCH+MLOGP,
fish)
```
***
In order to find the highest performing model we have to analyse all possible combinations of 6 variables in the model. To do this we use the reg subset function and the result is displayed on the graph to the left. Using the which max function we find the highest number of variables that can be used to form the best model. Looking at the graph it can be assumed that the NdssC variable hurts the final R^2 number rather than add to it. The model in this scenario would only include CIC0,SM1_Dz,GATS1i,NdsCH and MLOGP regressors.
### Diagnostic plot for selected model
```{r}
plot(best.model, 1, main = "RESIDUAL VS FITTED")
plot(best.model,2, main = "QQ PLOT")
plot(best.model,3,main = "EQUAL VARIANCE", lty = "blank")
plot(best.model,4, main = "RESIDUAL VS LEVERAGE" )
```
***
These are the diagnosis plots with the NdssC variable removed.We see slight improvements in the scale location. The points seem to be spread out and we dont see dark clumps as before. The residual vs fitted plot seems to be drawn closer to the center dotted line. The QQ plot appears to be unchanged. Let us now look at the performance of this model.
### Results
```{r}
ak1 <- lm(LC50~CIC0+SM1_Dz+GATS1i+NdsCH+MLOGP,fish)
summary(ak1)
```
***
Disregarding NdssC raised the R^2 value from 49% initally to 58.1% which can be gleamed from the summary table on the left.
***
### conclusion
The goal of this study was to determine whether a linear model can could be built from the provided dataset.; We were succesful in building a model with ~60% accurracy which is adequate for real world use. In other words reaseachers would be able to use this linear model and predict whether certain water bodies are toxic for the flathead minnow, given that other variables are known.Vastly increasing the probablility survivability for these .In conclusion, all models are wrong, some models are useful. This model was limited by the small number of observations in the dataset. As more data is collected we will be able to increase the accuracy of this model and will be able to expand this model to other fish as well.
### References
***
1) UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/index.php
2) ONR division of wildlife
http://wildlife.ohiodnr.gov/species-and-habitats/species-guide-index/fish/fathead-minnow
3) Dr Tessa Chen, University of Dayton,Ohio.
4) https://www.usgs.gov/media/images/fathead-minnow